Detecting groups of similar components in complex networks 
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We study how to detect groups in a complex network each of which consists of component nodes 
sharing a similar connection pattern. Based on the mixture models and the exploratory analysis set 
up by Newman and Leicht (Newman and Leicht 2007 Proc. Natl. Acad. Sci. USA 104 9564), we 
develop an algorithm that is applicable to a network with any degree distribution. The partition of 
a network suggested by this algorithm also applies to its complementary network. In general, groups 
of similar components are not necessarily identical with the communities in a community network; 
thus partitioning a network into groups of similar components provides additional information of 
the network structure. The proposed algorithm can also be used for community detection when the 
groups and the communities overlap. By introducing a tunable parameter that controls the involved 
effects of the heterogeneity, we can also investigate conveniently how the group structure can be 
coupled with the heterogeneity characteristics. In particular, an interesting example shows a group 
partition can evolve into a community partition in some situations when the involved heterogeneity 
effects are tuned. The extension of this algorithm to weighted networks is discussed as well. 

PACS numbers: 89.75.Hc, 89.75.Fb, 05.45.-a 



I. INTRODUCTION 

As a concise abstract model, the concept of network 
captures the most essential ingredients of a complex sys- 
tem, namely, its basic component units and their interac- 
tion configuration. This advantage — simple in form but 
powerful in modelling — has attracted intensive stud- 
ies of complex networks in a wide spectrum of contexts, 
ranging from natural sciences to engineering problems 
and human societies [l], H, Q. Roughly speaking, the 
investigations mainly fall into two categories: seeking 
the topological characteristics and their origins in one 
and understanding how they interact with the dynami- 
cal processes supported by the networks in the other. It 
has been found that topological characteristics, such as 
small- world [4J and scale-free [5| properties, are quite gen- 
eral; they are common features in a large set of networks 
from various fields. Moreover, they are closely related to 
the dynamical processes on the networks. Illuminating 
examples among many others include epidemic spread- 
ing, to which the surprising implications of the scale-free 
property have been well illustrated @, 0|; and network 
synchronization, where the role played by the topology 
can be marvellously separated and appreciated by ana- 
lyzing the master stability function Q . Such progress has 
greatly enhanced our belief in the significance of identifi- 
cation and detection of these important topological char- 
acteristics [H, H, H|. 

Community is another common topological feature 
that exists in many complex networks. Intuitively, a 
community refers to a set of nodes whose connections 
between themselves are denser than their connections to 
the nodes outside the set H US EH, El- Community 
detection is very important in network studies, because 



communities usually govern certain functions as seen in 
many biochemical networks [l3| and social networks [3] . 
Communities also have important implications to the dy- 
namical processes based on the networks, such as syn- 
chronization [TEl [lfjl . IT71 . [l8l |. percolation and diffusion 
[]~9l . [20L I2H l22j|. In addition, in networks of large size, 
community structure may serve as a crucial guide for re- 
ducing the network, which is believed to be helpful in 
shedding light on the most essential properties of a com- 
plex system [23|, [24j. In view of the importance of the 
community structure, there have been a lot of studies 
devoted to the issue of community detection. (See Ref. 
[25l | for a recent and comprehensive review.) Recently, 
attempts have also been made to extend the community 
detection methods developed in these studies to weighted 
networks [26|, [2?| and directed networks [H, H^] . 

However, community is not the only perspective for 
partitioning a network. For example, in a bipartite net- 
work, the best justified partition is to separate all the 
nodes into two groups such that nodes in one group only 
link to the nodes in the other. Indeed, partition perspec- 
tives other than that of community is necessary in order 
to have a better understanding of both the structures 
of complex networks and the dynamical processes they 
support, as shown in (3fj| by the study of synchronous 
motions on bipartite networks. 

An insightful idea is to partition a network into groups 
where nodes in each group share a similar connection 
pattern. As the connection patterns are various and can 
vary from group to group, this group model is very gen- 
eral and powerful in representing many different types of 
structures in a network. This idea has a long history. 
It was first introduced in social science by Lorrain and 
White where the nodes of similar connection pat- 
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tern are referred to as being structurally equivalent. This 
idea has fruitfully led to the analysis of networks in social 
(32l ] and computer science based on block modelling. A 
recent review can be found in Ref . [H| . 

In a recent study [34[, Newman and Leicht came up 
with a novel and general partition scheme based on this 
idea. It divides a network into groups of similar con- 
nection pattern. The most striking advantage of their 
scheme lies in that it can be applied for seeking a very 
broad range of types of structures in networks without 
any prior knowledge of the structures to be detected. In 
addition, the algorithm thus developed is ready to be 
used for both the directed and undirected networks, and 
it is straightforward to generalize it to analyze weighted 
networks [35| . The efficiency of the algorithm is also high 
in terms of computation complexity. Recently, Ramasco 
and Mungan [361 ] have analyzed this method in detail 
and devised a generalized Newman and Leicht algorithm 
based on their study. Other than the Newman and Leicht 
algorithm and its variant [36| , another intriguing and in- 
sightful scheme for partitioning a network into groups of 
similar connection pattern has also been developed based 
on the information theory (37j . 

The Newman and Leicht theory assumes that in a 
group the total outgoing degree must be larger than zero 
|3a | . This assumption limits the application of their the- 
ory. In order to overcome this limitation, it has been 
suggested in [36| to deal with the incoming degrees, out- 
going degrees, and bidirectional degrees separately. In 
this paper, we show that by assuming that all nodes in a 
group share the same a prior probability to connect uni- 
directionally to a given node (see analysis in Sec. Ill), 
this problem can be solved straightforwardly. The algo- 
rithm we develop based on this assumption can be ap- 
plied without any restriction on the degree distribution. 
Moreover, the partition of a network given by our al- 
gorithm can be shown to be exactly the same as that 
of its complementary network (see Sec. III). This is re- 
quired by the definition of a group of similar connection 
pattern. Another advantage of our algorithm is that it 
allows an analysis of the heterogeneity effects, which re- 
veals further useful information of the network structure. 
In addition to all of these, our algorithm shows clearly 
that it is the information whether there is a link between 
two given nodes, rather than the link exclusively (if it 
exists between the two nodes), that contributes to the 
partition. The information that there is no link between 
two given nodes is equally important. This insight pro- 
vides a new and different view for partitioning weighted 
networks. Our algorithm also inherits all the advantages 
of that by Newman and Leicht. 

In the next section, we first review briefly the theory 
by Newman and Leicht, and then point out the extent of 
its applicability. Next, in Sec. Ill, we develop our algo- 
rithm based on the a priori probability assumption and 
discuss its properties. After that we present examples of 
various types of groups together with the analysis of two 
real networks. We discuss in Sec. IV the role played by 



the involved heterogeneity effects, and show how a group 
partition can depend on it by the example of the karate 
network [38j ]. Finally, before summarizing the results of 
this paper, we discuss in Sec. V how to extend our algo- 
rithm to weighted networks. 



II. THE NEWMAN-LEICHT ALGORITHM 
(NLA) 

In search of the structures in a network, a dilemma we 
often encounter is that we have to input initially what 
structures we are intending to look for but this infor- 
mation is however usually unavailable before the struc- 
tures have been found successfully. As a result what we 
can find eventually may strongly depend on whether we 
have enough prior knowledge of the structures to be de- 
tected. To overcome this difficulty, Newman and Leicht 
[34j insightfully focused on the groups of similar con- 
nection pattern. In their theory, the connection pat- 
tern for a group is specified by sets of parameters to 
be determined. Initially, the information of these con- 
nection patterns is not required as input to the search 
algorithm thus designed; rather, they are shaped up dur- 
ing the search process (running of the algorithm) and 
produced as outputs. Finally, what the algorithm pro- 
vides simultaneously is not only the best way for group- 
ing the nodes, but also the common connection pattern 
that nodes in each group share. They made this possible 
by skillfully harnessing the probabilistic mixture models 
and the expectation-maximization algorithm [34j . As the 
groups of similar connection pattern are effective in mod- 
elling various structures in networks, their algorithm is 
very general and has a wide application spectrum. 

The main points of the Newman and Leicht theory are 
as follows. (For the sake of convenience and clarity, we 
take the same notation as in [34| throughout this pa- 
per.) Let us consider a network of n nodes belonging to 
c groups. Its connection configuration is given by the ad- 
jacency matrix A. If there is a link between node i and 
node j then Aij = 1 otherwise Ay = 0. In the Newman 
and Leicht theory, n, c and A are assumed to be known 
and used as the input for their algorithm. Here the num- 
ber of groups c is the only information needed in advance 
about the partition. If it is unavailable, it should be as- 
sumed or estimated based on other known information of 
the network. 

Next, the connection configuration A is assumed to 
be a realization of an underlying statistical model de- 
fined by two sets of probabilities denoted by ir = {7r r } 
and 9 = {0 r j}, respectively, with r — 1, ••• ,c and 
j = 1, • • • ,71. This statistical model assumes that each 
node has probability 7r r to fall in a group r and for all 
nodes in that group they have the same probability - 
closely related to 9 r j — to connect to a given node j. 
Here 8 r j is equivalent to the portion of the outgoing links 
of group r that connect to node j. The outgoing links 
of group r refers to the outgoing links that all nodes in 
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group r have. 

In this sense 9 r = {9 r j,j = 1, • ■ • , n} defines the con- 
nection pattern shared by all nodes in group r. As long 
as 7r and 9 are known, together with the adjacency ma- 
trix A as measured data, one can obtain the probability 
for observing the node i being in the group r, namely 
qi r = Pr((?i = r\A, 7r, 9), and thus all the information 
about the group partition. Here gi represents the group 
to which the node i is regarded to belong in a certain 
partition; we use q and g to denote {qi r } and {gi} re- 
spectively. 

Hence the key is to specify tt and 9. Newman and 
Leicht assumed that the right values of the elements of tt 
and 9 are those that maximize the likelihood to observe 
the connection configuration A and a certain partition g, 
namely Pr(A, g\-K, 9), or equivalently those that maximize 
its logarithm 

L = lnPr(j4, g\ir, 9). (2.1) 

In this way, the problem is converted to a solvable fitting 
model problem with the help of the maximum likelihood 
method [3J] . The next task is then reduced to find tt and 
9 that satisfy this requirement. 

To proceed further, Newman and Leicht adopted a cru- 
cial simplification: they suggested instead to maximize 
the averaged L over all possible partitions: 




FIG. 1: Two examples where the Newman and Leicht algo- 
rithm (NLA) does not apply. According to the definition, of 
the two groups (of similar connection pattern) in the left net- 
work (a) [36(1 one contains the left two nodes and another con- 
tains the right two; and of the two groups in the right network 
(b) one consists of the center node and another consists of the 
rest. However, due to the fact that one group, of the right 
two nodes in (a) and the peripheral nodes in (b) , has no out- 
going links, the Newman and Leicht algorithm (NLA) fails to 
partition them correctly. As a comparison the APBEMA has 
no restriction on the degree distribution; it partitions these 
two networks without any ambiguity. 

Then tt and 9 that maximize C were deduced in terms of 
A and q as 

7r r = -yv r , (2.8) 

n z — ' 

i 



C = ■ ■ ■ E P <9\A tt, 9) lnPrCA, g\n, 9). (2.2) 
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As {gi} are summed out, this simplification allows one to 
write down analytically the solutions of tt and 9 in terms 
of A and q, and develop an efficient iterative algorithm 
based on them. In detail, starting from 



and 



Pr(4 5 ,M) = rM 



PrfoM) = 



Newman and Leicht obtained 

Vv{A,g\-K,6)=Y[-K gi Y[6, 
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and 



C = qi r [In 7r r + Aij In ( 



(2.3) 



(2.4) 



(2.5) 



(2.6) 



(2.9) 



where fc, = V ■ A^ denotes the outgoing degree of node 
i. Eqs. (|2~7|) . (|2^|) and (|23| thus define the Newman- 
Lcicht algorithm (NLA). It runs in an iterative way: at 
each step, the old values of the elements of q, tt and 9 are 
substituted into the right hand side of these equations 
to generate their updated values. The convergent result 
of 9 then defines the connection patterns of groups and 
that of q suggests grouping. In practice, the calculation 
converges rapidly. (We found that the convergence time 
goes as ~ 0(n 2 ) in all the networks we have analyzed 
with the NLA, including those that are not presented in 
this paper.) 

It should be noted that in getting Eqs. (|2.8|) and (|2.9j) 
the following constraints imposed on n and 9 have been 
taken into consideration: 



and 



(2.10) 



(2.11) 



with 



(2.7) 



Indeed, the results given by Eqs. (|2.8[) and (|2.9p satisfy 
these requirements. In addition, the results of Eq. (|2.8j) 
and Eq. (|2.9|) are in consistency with the definitions of 
7T r and 9 r j. In particular, Eq. (|2.9p makes it clear that 
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9 r j is the expected portion of the outgoing links of group 
r that connect to node j. 

The definition of 8 r j and the corresponding normal- 
ization condition imposed by Eq. (|2.11[) imply that the 
partition given by the NLA must be such that each group 
has at least one outgoing link (36|. This constraint lim- 
its the application range of the NLA. An example cited 
in [36| (see Fig. 2 in j36j) is a directed bipartite net- 
work which is reproduced in Fig. 1(a). According to the 
definition of a group of similar connection pattern, this 
network should be partitioned into two groups such that 
one contains the left two nodes and one contains the right 
two nodes, respectively However, as the right group has 
no outgoing links, NLA would suggest instead a partition 
into the upper two nodes and the lower two nodes, or the 
whole network as a single group 36]. Another example 
is the directed star as shown in Fig. 1(b); NLA parti- 
tions all nodes into one group though from the viewpoint 
of similar connection pattern or symmetry we expect the 
center node to be in one group and other peripheral nodes 
in another. 



III. A PRIORI PROBABILITY BASED 
EXPECTATION MAXIMIZATION ALGORITHM 
(APBEMA) 

In this section we present an expectation maximiza- 
tion algorithm that does not have any restriction on the 
degree distribution of a group. In addition, it also has 
many other advantages which will be discussed in the fol- 
lowing sections. Our method is in the same spirit as the 
NLA, but the statistical model of the group is different. 

First let us suppose the network under consideration 
has n nodes that belong to c groups, and the connection 
configuration is given by the adjacency matrix A. Simi- 
larly, we assume n, c and A are known and serve as the 
input. 

Next, as in the NLA, we assume that each node has 
probability 7r r to fall in group r. n r in effect reflects the 
size of group r, which is expected to be mr r . As any node 
must be in the network, we have 
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(3.1) 



However, to specify the connection pattern of a group, 
we take the a priori probability assumption instead. We 
assume that in a given group r all its nodes share the 
same a priori probability, denoted by p r j, to connect 
unidirectionally to a given node j. As such p r j should 
satisfy < p r j < 1. We also assume that p r i is indepen- 
dent of p r j for i j; namely, the probabilities for a node 
(in group r) to connect to two different nodes are com- 
pletely independent. The normalization condition for p r j 
can be expressed as p r j + (1 — p r j) = 1, where (1 — p r j) 
stands for the probability with which a node in group r 
does not connect to node j. As compared with the NLA, 
here we need not introduce a normalization condition like 



Eq. (|2.11[) ; p r j can take any allowed value (0 < p r j < 1) 
independently. It is this flexibility and adaptability that 
makes our algorithm applicable in principle to any net- 
work. 

Now we follow the NLA to develop the algorithm based 
on 7r = {7r r } and p = {p r j}- In order to introduce less 
notations, here we take all other symbols adopted in the 
NLA except 9 and maintain their original meaning (with 
being replaced by p where necessary). We also refer 
to our algorithm the a priori probability based expecta- 
tion maximization algorithm (APBEMA) in the follow- 
ing. Our starting point is the conditional probabilities 
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Pr(ffh/j) = IP 



Pr(A| 5 ,7r,p)=I]^(l-p 3iJ ) 1 



(3.2) 



(3.3) 



It should be stressed that the right hand side of Eq. 
(|3.3|) accounts for not only the probability for the pres- 
ence of a link (Ay = 1) but also that for a null link 
(Aij = 0), hence honestly reflects the conditional proba- 
bility for observing the configuration given by A. As can 
be seen in the following, it also implies the null links are 
as equally important as links for partitioning a network, 
which agrees well with our intuition. 

Our next task is to find n and p that maximize 

c c 

C = ■ ■ Yl P <9\A tt, p) lnPr(A g\n, p). (3.4) 
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It can be rewritten as 

C = ^2 q vr [In 7r r + ^2 Ai 3 In p rj 

i,r j 

+ J2(l-A ij )(ln(l-p rj )] (3.5) 
3 

if we substitute Eqs. (22]) and (O into Eq. $<£4§ with 



(3.6) 



Here q„ = Pr(gi = r\A, 7r,p). Apparently, it satisfies the 
normalization condition ^ r q^ r = 1 as required. 

Now we are ready to obtain tt and p that maximize C 
with the only constraint ^ r 7iy = 1. We set 



/(tt, p, a) = C - a(^2 - 1) 



(3.7) 



with C being given by Eq. (|3.5[) and a the Lagrange 
multiplier introduced. By solving the following equations 



— = 0, - — = 0, and = 0, 

oa oir r op r j 



(3.8) 
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we obtain 



and 



prj 



n £ — ' 



J2i Ajjqi; 



(3.9) 



(3.10) 



The n we g et the APBEMA defined by Eqs. (|3T5|) . (|33|l 
and (|3.10[) . Its iterative implementation is the same as 
that for the NLA, hence it has the same efficiency in 
terms of computational complexity. Also as in the NLA, 
the convergent values of {qi r } suggest the partition, and 
those of {p r j} describe the connection patterns of groups. 

It is worthwhile noting that according to Eq. (|3.10p 
< p r j < 1 as expected. In addition, Eq. (|3 . 10[) is con- 
sistent with the meaning of p r j, namely, the probability 
with which a node in group r is unidirectionally linked 
to node j. This can be seen further from ■ p r j, which 
represents the averaged outgoing degree a node in group 
r has. Indeed, according to Eq. (|3.10[) 



^2 Prj 

3 



(3.11) 



(ki = J2j Aij is the outgoing degree of node i.) The right 
hand side of Eq. (|3.11|) is exactly the expected outgoing 
degree of a node in group r. 

To summarize, our algorithm is based on the apriori 
probability assumption. It is this difference in the mean- 
ing between p r j and 9 r j that makes the APBEMA rad- 
ically different from the NLA despite their similarity in 
form. 



A. Properties of the APBEMA 

The APBEMA developed previously has the following 
properties: 

(i) Applicable without any restriction on the degree dis- 
tribution. Even in the trivial and less meaningful exam- 
ple where the network contains some isolated nodes the 
APBEMA can successfully assign them into one group, 
say group r, that is characterized by p r j — 0. For the ex- 
amples shown in Fig. 1, the APBEMA partitions them 
without any ambiguity in the sense that the output val- 
ues of p r j and are all virtually zero or one. For the 
directed bipartite network shown in Fig. 1(a) it suggests 
the left two nodes in one group and the right two in an- 
other while for the directed star (Fig. 1(b)) it separates 
the center node from the rest just as expected. (To ap- 
ply the APBEMA to these two networks, the number of 
groups has been assumed to be c = 2.) 

(ii) Suggesting the same partition for the complemen- 
tary network. By the complementary network of a net- 
work specified by the adjacency matrix A, we mean the 



network which has the same nodes but its adjacency ma- 
trix A' is related to A via A\j = 1 — Aij . Namely, a link 
in network A is a null link in its complementary network 
A' and vice versa. Obviously, a group r in A character- 
ized by {p r j} {j = 1) • • • ) n) is still a group in A' with 
{p'rj = 1 — p r j} according to the definition of group. 
Hence an algorithm aiming at identifying the groups of 
similar connection pattern should suggest the same parti- 
tion for both a network and its complementary network. 
This is the case for APBEMA, which is guaranteed by the 
symmetry of 1 — Aij — * AL , 1 — p r j — * p' r j , — » tJ. and 
qir -> q'i r in Eqs. (j33]) and (gUPD - This symmetry 

also implies that null links play the same important role 
as links in partitioning a network. A further discussion 
will be given in Sec. V. 

(iii) Applicable to both directed and undirected net- 
works. Although the APBEMA we obtain here is for 
directed networks, it can be extended without any mod- 
ifications in form to undirected networks. The argument 
is similar to that given in [3~i| : In an undirected network, 
Prj is still the probability for a node in group r to con- 
nect to node j; the probabilities for there is and there 
is no link between node i and node j are pgijpg^i and 
(1 — p gi ,j)(l — Pgj.i), respectively. Hence 

Pi(A\g,7T,p) 



i>j 



-At 



(3.12) 



Aji has been 



which is the same as Eq. p. 31) . (Aij 
used.) Other derivations are then exactly the same as in 
the directed case. 

(iv) Powerful in accounting for the heterogeneity ef- 
fects on grouping. The APBEMA allows us to prescribe 
the involved heterogeneity effects of the outgoing degree 
distribution. This can be done by conveniently intro- 
ducing a tunable parameter to the APBEMA. With this 
extension, we can study how the degree heterogeneity 
may affect the grouping results in a controlled way. In 
the situations where we desire to bias the heterogeneity 
effects on the grouping this extended algorithm would be 
superior. This algorithm will be discussed in detail in 
Sec. VI. 

(v) Applicable to weighted networks. With a straight- 
forward extension, the APBEMA can also be used to 
analyze weighted networks. A detailed discussion will be 
presented in Sec. V. 

(vi) The same efficiency as the NLA in terms of com- 
putational complexity. 



B. Examples 

To show how well the APBEMA works, we present in 
this subsection several typical examples. Just as in the 
NLA, besides the adjacency matrix A we also need to 
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FIG. 2: An example for showing that the APBEMA 
can identify the groups of similar connection pattern in a 
homogeneous network constructed according to the defini- 
tion of group. The network contains n — 60 nodes which by 
construction are divided into two sets of equal size. In each 
set the nodes are randomly connected with the average intra- 
group degree k ra = 13, and between the two sets the links 
are randomly connected with the average inter-group degree 
k inter . The error rate by the APBEMA is shown as a function 
of the inter-group degree k mter . The two sets are successfully 
recognized for k mtra > k lnter and k lntra < fc" lter when the 
group structure is clear. 

set the number of groups, c, as another input. For all 
the examples throughout this paper we assume that this 
information has been known. In particular, we set c = 2 
in all other examples except for the case of the American 
college football teams where c = 12 is assumed. 

The first example is a homogeneous undirected net- 
work. We simply divide n nodes into two sets of equal 
size and in each of them nodes are randomly intra- 
connected with the average intra-degree k mtra . After 
that the inter-group links are randomly added with the 
average inter-group degree k mter . Obviously, these two 
sets are two groups according to the definition, and when 

kl ntra > fcinier (frintra < fcI nter) they &re assorta tively 

(disassortatively) connected. In practice, the larger the 
difference between k mter and k mtra is, the clearer the 
group structure would be, and the easier it should be to 
detect the groups. 

The results for n = 60, k mtra = 13 against k inter are 
summarized in Fig. 2. We find that the APBEMA works 
well: it identifies successfully both the assortatively and 
disassortatively linked groups when their structures are 
clear. If k mtra and k mter are too close it fails just as 
expected. 

It is interesting to note that when k mtra 3> k mter the 
two groups can be seen as two communities. This fact 
suggests that in the cases when groups and communities 
overlap with each other in a network the APBEMA can 
be used to detect communities as well. Given this, it 
is expected that for k mtra -C k mter , when the network 
becomes bipartite-like, the APBEMA works equally well. 
This is because the complementary network in this case 



FIG. 3: An example for showing that the APBEMA 
can identify the groups of similar connection pattern in a 
heterogeneous network constructed according to the defini- 
tion of group. The error rate (solid dots) is for the group 
detection result by the APBEMA in identifying a fully con- 
nected clique of n c = 7 nodes immersed in a randomly con- 
nected background of 63 nodes whose average degree k BG is 
varied for investigating how the error rate depends on it. For 
k BG < n c the APBEMA works very well (the error rate is 
smaller than < 10%), and the error rate due to wrongly par- 
titioning the clique nodes into the background (open squares) 
is small and can be neglected. In this case the error rate is 
mainly contributed by wrongly partitioning the background 
nodes into the clique as a result of fluctuations in building the 
network. 

is a community network, and as having been pointed out 
in the last subsection, the APBEMA is symmetric for a 
network and its complementary network. Indeed, such a 
symmetry has manifested itself clearly on the error rate 
curve presented in Fig. 2. 

To measure the error of group detection, we define the 
error rate e as the sum of the portions of nodes wrongly 
partitioned into the opposite group: 



where n\ (71-2) is the number of nodes in the first (second) 
group and 8ni2 (5n2i) the number of nodes belonging to 
group 1 (2) but are assigned to group 2 (1) by the algo- 
rithm. If the nodes are randomly assigned to each group, 
or all nodes are simply regarded as belonging to a sin- 
gle group, the error rate so defined takes the value one 
and implies a complete detection failure. It is zero only 
when all the nodes are correctly grouped. To suppress 
the fluctuations, for every data point presented in Fig. 
2 we have averaged the error rates evaluated over 1000 
realizations of the network. We have also checked that 
with other definitions of the detection error, for example, 
that used in Ref. [H, H3, 5J , which is based on the nor- 
malized mutual information, the results are qualitatively 
the same. This is also the case for all other examples 
throughout this paper where the error rate is evaluated. 
In our second example the groups are connected in a 
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way neither purely assortative nor purely disassortative. 
First we build a random homogeneous and undirected 
network of n nodes with the average degree k B , then 
we chose from them n c <C n nodes randomly and fully 
connect them to form a clique. We then have two sets of 
nodes: the clique, whose nodes have an average degree 
(n c — 1) + (1 — n c /n)k BG , and the one consists of the 
rest nodes which we call the background, whose nodes 
have an average degree k BG . We restrict ourselves to the 
case k BG <c n c , namely, the degrees of the nodes in the 
clique are much larger than those in the background, thus 
making the clique quite outstanding to the background. 
Hence the network under consideration is in fact highly 
heterogeneous. It should be pointed out that in this case 
the communities occasionally formed in the background 
due to fluctuations [42J can be neglected, and according 
to the definition the clique and the background are two 
groups since nodes in themselves share the same connec- 
tion pattern that can be appropriately specified in terms 
of {p r j}- Furthermore, this network is neither assorta- 
tive nor disassortative; it is not a community network 
either because the background nodes are connected be- 
tween themselves the same densely as they are connected 
to the clique nodes. 

In Fig. 3 the partition results by the APBEMA for 
n = 70 and n c = 7 are shown against the average degree 
of the background nodes, k BG . It can be seen that for 
k BG <C n c it gives the correct partition perfectly. In fact, 
the APBEMA works well all the way up to k BG — n c 
with the error rate smaller than 10%. As fc is in- 
creased further the clique becomes less distinct from the 
background, and the fluctuations in the background be- 
gin to play a role. As a result the error rate starts to 
increase quickly. Further investigations show that for 
k BG < n c the detection error due to wrongly partition- 
ing the clique nodes into the background (open squares 
in Fig. 3), namely Snw/ni in Eq. (|3.13j) (subscript 1 (2) 
indicates the clique (background)), is very small and can 
be safely neglected. The detection error is mainly con- 
tributed by wrongly partitioning the background nodes 
into the clique in certain network realizations due to 
fluctuations where the wrongly partitioned background 
nodes happen to have a higher degree and more links 
connecting to the clique nodes. On average the total 
number of the wrongly partitioned nodes (mainly from 
the background to the clique) is about 0.11, 0.39, 0.88 and 
1.6 for k = 1, 2, 3 and 4 respectively. In this calculation 
1000 realizations of the network are considered again to 
average the error rate. 

The network studied in this example could be relevant 
for studying some real networks containing cliques. The 
success of the APBEMA is a good indication of the flexi- 
bility and adaptability of the apriori probability assump- 
tion, and suggests that the APBEMA may find some 
unique applications in certain partition problems. 

In general, in a community network the nodes in a 
community may not share the same connection pattern. 
In such cases the group partition can be different from 




FIG. 4: The dolphin social network 0, 53 . Nodes denoted 
by solid squares and solid dots represent the two disjointed 
subdivisions the network split into during the development of 
the network [45| after the departure of a key member SN100 
(open dot). The dashed line is the group partition suggested 
by the APBEMA corresponding to the largest value of C 
which regards nodes SN89 and PL belonging to the opposite 
subdivision but all others nodes to their own subdivisions. 
This is one real network example where the APBEMA can be 
used to detect the community structure. 



that of the community partition. Such an example will 
be discussed in the next section. However, in the cases 
where they do share the same connection pattern, or ap- 
proximately do, our algorithm can then be used to find 
the community structure. This has been seen in the first 
example (Fig. 2) when the two groups are assortatively 
connected. In the following we show two examples of real 
community network where the partition result given by 
our algorithm is in good agreement with the community 
partition. 

The first one is a network of bottlenose dolphin pTj 
living in Doubtful Sound, New Zealand [H, EL Ell which 
is composed of 62 dolphins (nodes) and 159 social ties 
(edges). It is assembled by researchers over years (Fig. 
4). During the course of the investigation of this network, 
it split into two disjointed subdivisions [45| of unequal 
size (represented by solid squares and solid dots in Fig. 
4 respectively) following the departure of a key member 
named SN100 (denoted by the open dot in Fig. 4). The 
group partition provided by the APBEMA corresponding 
to the largest value of C agrees very well with the natural 
splitting except two nodes named PL and SN89. 

The second example is the network of the American 
college football teams [46j . The network is a map of the 
schedule of Division I games for the 2000 season where 
115 nodes represent the teams and 616 edges represent 
regular-season games between the two teams they con- 
nect [46j]. All 115 teams are organized into 12 conferences 
each of which contains about 8-12 teams. As games are 
usually more frequent between members of the same con- 
ference than between members of different conferences, 
most conferences can be seen as communities. But be- 
cause there are few of them whose teams played more 
or nearly as many games against teams in other confer- 
ences than/as those in their own conference, the network 
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FIG. 5: The network of the American college football teams 
extracted from the schedule of Division I games for the 2000 
season 46]. The nodes denoted by the same symbols belong 
to the same conference. The grouping result produced by the 
APBEMA with assumed group number c = 12 is represented 
by the clusters. Stars stand for the "IA independence" con- 
ference which are scattered due to their sparser connections 
inside. In this case the groups given by the APBEMA coin- 
cide with the communities very well despite the scattering of 
the "IA independence" conference. This is another example 
in addition to the dolphin network (see Fig. 4) where the 
APBEMA can be used to detect the community structure. 



structure does not reflect the genuine conference struc- 
ture perfectly [4o| . 

The partition suggested by APBEMA is presented in 
Fig. 5. (The number of the groups is assumed to be 
c = 12 as input.) It can be seen that the group structure 
suggested has a fairly accurate coincidence with that of 
the conference. In particular, five groups (the top five) 
are completely the same as the corresponding conferences 
without any nodes wrongly assigned to/from other con- 
ferences, and five others have only one or two nodes being 
assigned to/from other conferences. The most obvious 
mismatch lies in the partition of the conference "IA inde- 
pendence". Its members, Central Florida, Connecticut, 
Navy, Notre Dame and Utah State (denoted by stars in 
Fig. 5) are assigned to other groups rather than in their 
own. Considering the fact that they have more games in 
the conferences they are assigned to than in their own, 
this is reasonable and somehow expected. 

To summarize this subsection, the APBEMA performs 
well in identifying various structures in a network. More 
examples and further discussions of the presented ones 
will be given in the following sections. 



IV. EFFECTS OF HETEROGENEITY ON 
GROUPING 

In this section we study how the degree heterogeneity 
may affect the grouping results. Theoretically this prob- 
lem is interesting as it is related to a general issue in 
network study, namely, whether/how two different types 
of topological characteristics are coupled. Obviously, in 
the APBEMA the coupling between the degree distribu- 
tion and the group structure is inherent: The APBEMA 
suggests the grouping based on the connection patterns it 
recognizes, but the connection patterns are in turn eval- 
uated based on the outgoing degrees. The close relation 
between the connection patterns (given by {p r j}) and the 
outgoing degrees, {h}, can be seen clearly in Eq. (|3.11[) . 

Then the next question for our aim here is how the 
APBEMA captures the degree heterogeneity. A key ob- 
servation is that the APBEMA models the network in a 
coarse-graining way. It uses the groups as the 'patches' 
to represent different parts of the network, hence in ef- 
fect the network is characterized at two different lev- 
els. At the lower level, namely inside each group, the 
APBEMA has assumed that all nodes are identical and 
statistically independent. Therefore the structure of a 
group, its degree distribution as well, has been assumed 
to be homogeneous. So at this level the heterogeneity is 
not captured by the APBEMA, which can be seen as a 
simplification adopted by the APBEMA. The difference 
between the outgoing degree of a node from its expected 
value (i.e. V . Eq. p. lip ) in a group is treated by 

the APBEMA as a result of the statistical fluctuations. 

However, at the level of groups the APBEMA is flexi- 
ble. It allows the statistical characteristics of the groups 
to vary from group to group so that the local structures of 
the network are given the best matching. Therefore it is 
at this level that the heterogeneity is taken into account 
by the APBEMA. With this understanding we may imag- 
ine that the APBEMA tries to mimic the degree distri- 
bution function with a series of peak-like functions. Each 
peak-like function corresponds to a homogeneous degree 
distribution in a group, and its position represents the 
average outgoing degree of the group. 

Hence if the network is heterogeneous, then the hetero- 
geneity would be characterized by the distances between 
these peaks. A good example is the network studied in 
Fig. 3; its degree distribution function happens to be 
one of two narrow peaks representing the clique and the 
background. The distance between them tells directly 
how heterogeneous the whole network is. For a more 
general degree distribution function, though it is hard to 
infer all the information of the heterogeneity based on 
the distances between these peak-like functions, they are 
still a good indicator of it. Another (opposite) extreme 
case is for the homogeneous networks, see for example the 
one presented in Fig. 2, where all these peak-like functions 
overlap with each other and the distances between them 
are all zero. 

What we have learned here implies that if we can ap- 
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propriately preset the positions of these peak-like func- 
tions, namely the average outgoing degrees of the groups, 
then we can interfere the way the APBEMA considers 
the heterogeneity effects. Our aim in this section is to 
develop such an algorithm. For example, if all the aver- 
age outgoing degrees are taken to be equal, then we have 
in effect suppressed the heterogeneity effects to be con- 
sidered completely. This extreme case will be discussed 
in the first subsection in the following. The APBEMA 
discussed in Sec. Ill has taken into account the hetero- 
geneity effects as fully as it can, so it stands as another 
extreme. In the second subsection we will discuss how 
to introduce a control parameter to build an interpolat- 
ing algorithm such that the heterogeneity effects involved 
can be tuned between these two extremes continuously. 
Then we will show in the third subsection by the example 
of the karate network [38| how the heterogeneity plays its 
role in grouping. A comparison with the dolphin network 
will reveal an interesting underlying structural difference 
between the two networks. 



A. The heterogeneity suppressed algorithm (HSA) 

As discussed in Sec. Ill, J2j Prj gives the expected 
outgoing degree for a node in group r. If we assume 
that all the nodes, regardless of which group they belong 
to, have the same expected outgoing degree, then • p r j 
should satisfy 



with 



(4.1) 



where (d° 



Aij is the average outgoing degree 



over the whole network. With this consideration, we can 
build up a grouping algorithm where the effect of het- 
erogeneity is completely suppressed. First we start from 
Eqs. (13. 2|) and ()3.3|) and get £ as in Eq. ()3.5|) and qi r as 
in Eq. (13. 6j) . namely, 



Br 



(4.5) 



We refer to this algorithm defined by Eqs. (|4. 2[) - (j4. 5[) the 
heterogeneity suppressed algorithm (HSA) . As expected, 
if we impose zero to all /3 r , then the APBEMA is re- 
trieved. 

Compared with the APBEMA, the change in form of 
the HSA caused by j3 makes its implementation different: 
Here in fact two cycles of iteration, the outer one and the 
inner one, are involved. At each step of the outer cycle, 
we update q and ir via Eqs. (|4.2p and (|4. 3|) first, then 
we come into the inner cycle given by Eqs. (|4.4[) and 
(|4.5p with which the values of p and (3 are iterated till 
they converge. Then a whole step of the outer cycle is 
finished. The outer cycle is continued till all the values 
of q, it, p and (3 become stable. We notice that among 
various ways to perform the inner iteration according to 
the equivalent transforms of Eqs. (|4.4|) and (|4.5[) the one 
given by Eqs. (I4.4[) and (|4.5[) is the best: It converges in 
all the cases we have ever tested and the running time is 
the shortest. (We find the running time also scales with 
n as ~ 0(n 2 ) but is about two times of that consumed 
by the NLA and APBEMA.) 



B. The heterogeneity weighted algorithm (HWA) 

Now we have two extreme algorithms at hand: in one 
(the APBEMA) the heterogeneity is given full consid- 
eration and in another (the HSA) it is completely sup- 
pressed. Inspired by the way we construct the HSA, we 
realize that an 'interpolating' algorithm bridging the two 
extremes can be created by introducing a tunable param- 
eter w into Eq. (|4. 1|) such that 

Uw) = = w(d° r ut ) + (1 - w)(d° ut ) (4.6) 



(4.2) 



again. Then we can get it and p with constraints of 
J2 r Tr r = 1 and those imposed by Eq. (|4.1[) by setting 
/(tt, p, a,0) = C- a(E r »r P - 1) - £ r ME, Prj - (d out )) 
and requiring that the partial derivatives of / with re- 
spect to its variables to be zero, a and (3 = {/3 r } serve 
as Lagrange multipliers of the constrains. It leads to 



and 



Prj 



11 fc * 



PrPrj +T,i A ij<lir 



(4.3) 



(4.4) 



with 



(4.7) 



Now £ r (vj) is the average outgoing degree we impose on 
the group r, and the parameter w prescribes the weight 
of the heterogeneity. For w — 0, £ r (w = 0) = (d out ), 
then no difference of the expected outgoing degrees be- 
tween the groups is considered; Eq. (14. 6|) is then reduced 
to Eq. (gU). For w = 1, £ r (w = 1) = (d° ut ), which is 
exactly the average outgoing degree of group r when the 
heterogeneity is fully considered; it is then reduced to 
Eq. (|3.11[) . For other values of w (0 < w < 1) the aver- 
age outgoing degree £ r (w) takes the linear interpolating 
values between £ r (u> = 0) and S, r (w — 1) as a result. 

Following the derivations as in the HSA, the solution of 
7r and p under constraints J2 r n r — 1 and J2j Prj — &( w ) 
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are still given by Eqs. (|4.2[) - (|4.4[) . but j3 r now reads 
£ r (w)nir r - J^i kiqir 



T^p 2 



(4.8) 




instead. It is easy to show that for w = it reduces 
to Eq. (|4.5[) and the HSA is retrieved, and for w = 1 
as p r — we have the APBEMA again. For < w < 1 
we thus have an intermediate algorithm in between where 
only partial effects of heterogeneity are considered, hence 
in effect it is a heterogeneity weighted algorithm (HWA) . 
By changing w one can therefore conveniently adjust the 
degree of heterogeneity involved and investigate how it 
may affect the grouping results. The numerical imple- 
mentation of this algorithm is the same as the HSA. 

As a trivial test this heterogeneity weighted algorithm 
has been applied to the example in Fig. 2. As it is a 
homogeneous network, we can expect that weighting the 
heterogeneity will not produce any effects. Namely, the 
partition results shown in Fig. 2 does not depend on 
w. Another trivial test is the clique-background network 
studied in Fig. 3. As in this example the groups are 
characterized by their own average degrees, we may ex- 
pect that suppressing the heterogeneity effects may blur 
the line of distinction of the two groups and hence cause 
a detection deterioration. These conjectures have been 
fully verified by our simulations (the data of which are 
not shown here). 

In the following we will consider some more meaningful 
and inspiring examples. In particular we will apply the 
HWA to two real social networks. Interesting results will 
be discussed in detail. 



C. Analysis of the karate club 

In Ref. 38], Zachary reported an anthropological 
study of a karate club in a university. During the de- 
velopment of the club, two groups led by the instructor 
and the president formed gradually and in the end, due 
to the lack of a solution to a dispute, the club split. In 
recent years, the network of this karate club has been 
widely used for testing various community finding tech- 
niques, including the NLA in [34[ where it has been found 
that the result of the NLA is in good agreement with the 
true splitting. 

To apply our heterogeneity weighted algorithm, it is 
found that for w = 0, namely the heterogeneity effects are 
completely suppressed, the partition result is the same as 
that given by the NLA (Fig. 6(a)). But for w = 1 (Fig. 
6(b)), when the heterogeneity effects are fully considered, 
it suggests that those dominant nodes (open dots in Fig. 
6(b)) belong to one group and the others belong to an- 
other group. Such a result (Fig. 6(b)) is not surpris- 
ing because nodes in each group are indeed much more 
similar, which agrees better with our definition of group. 
For example, nodes in each group have more similar de- 
grees; they have the similar connection pattern as well: in 




FIG. 6: Grouping results for the karate club network in a uni- 
versity [33] given by the heterogeneity suppressed algorithm 
(HSA) (a) and the APBEMA (b) respectively. The two al- 
gorithms correspond to the special cases of w = (a) and 
w = 1 (b) of the heterogeneity weighted algorithm (HWA). 
The groups are distinguished by different symbols represent- 
ing the nodes. The partition in (b) shows the groups may not 
be identical with the communities in a community network. 



the dominant group nodes are weakly connected to each 
other and serve as the branches of the whole network, 
while in the other group nodes are only sparsely con- 
nected between themselves and look like leaves attached 
to the dominant group. This partition is also meaningful 
in reality: it recognizes the leaders and coordinators from 
the other members. It is important to note that from a 
different viewpoint based on the information theory [3^ ] , 
similar partition result has been obtained (see Fig. 4B 
in [37j )■ This example shows clearly that the groups of 
similar components may not be the same as the commu- 
nities in a community network. In order to have a better 
understanding of the network structure, analysis of both 
is necessary. 

Now let us look at what happens if the weight of the 
heterogeneity is changed. Starting from w = 0, each time 
we increase w with a small step Aw and then iterate 
the stabilized results of q, ir and p obtained at w until 
they converge. In this way, we can trace the partition 
shown in Fig. 6(a) up to w = 1. Similarly, starting from 
w = 1, the partition shown in Fig. 6(b) can be traced 
back up to w close to zero. The values of C evaluated 
by Eq. (|3.5[) that correspond to these two groupings are 
presented in Fig. 7. We can find that the corresponding 
£ value for the partition in Fig. 6(a) changes only very 
slightly during this process, but that for the partition 
in Fig. 6(b) is, first, smaller when w is close to zero, 
but it increases continuously with w and at w c ~ 0.37 it 
begins to become larger. For w > w c , the fact that the 
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FIG. 7: Study on how the grouping of the karate club net- 
work [381 ] depends on the degree heterogeneity by using the 
heterogeneity weighted algorithm (HWA). The L values corre- 
sponding to the two groupings shown in Fig. 6 are presented 
as functions of w, the weight of the heterogeneity. They are 
two maxima and intersect at w c w 0.37. It suggests that when 
the heterogeneity effects are suppressed (w < w c ) the parti- 
tion as in Fig. 6(a) is preferred but when the heterogeneity 
effects are more fully considered (w > w c ) the partition as 
in Fig. 6(b) is recommended instead. It shows that a group 
partition can depend on the heterogeneity effects strongly. 

partition of Fig. 6(a) can still be traced suggests that 
the corresponding value of L is, though not global, still 
a local maximum as well. (As both partitions coexist for 
our algorithm as maxima of C, we believe that a network 
analysis by the expectation maximization method would 
be more powerful if local maxima solutions other than 
that of the global maximum are considered in addition.) 

Fig. 7 shows clearly the important role played by 
the heterogeneity in the definition and detection of the 
groups and communities. In this example we have both 
groups and communities. As they are identical for 
w < w c , that is where our algorithm can be used to 
detect the communities. If we insist that only the solu- 
tion corresponding to the global maximum of C defines 
the groups, then they are different from the communities 
when w > w c . 

On the other hand, as w sets the weight of the het- 
erogeneity to be considered, this tunable algorithm is 
quite flexible and may find some interesting applications 
in practice, in particular in those situations where we 
wish to stress or weaken the effects of the heterogeneity 
on purpose. 

Next let us cite the social network of dolphin as a com- 
parison. In Fig. 8 the three largest maxima of C value 
are shown as functions of the weight of the heterogeneity. 
There are not any intersections between them. This fact 
may suggest that we have a unique grouping and it is ro- 
bust to the heterogeneity. This is verified by the careful 
investigation that shows the partitions corresponding to 
these curves indeed do not change with w. The groupings 
corresponding to the largest two C maxima are given by 
Fig. 4 and Fig. 9 respectively. A comparison between 



FIG. 8: Study on how the grouping of the dolphin network 
S3i 0, EH depends on the degree heterogeneity by using the 
heterogeneity weighted algorithm (HWA). The three largest 
maxima of the £ value against the weight of heterogeneity, w, 
are shown. The grouping of the network corresponding to the 
top (middle) curve is given in Fig. 4 (Fig. 9). It suggests that 
in this example the group structure depends insensitively on 
the heterogeneity effects. 



these two partitions is interesting: the only difference lies 
in the node PL. On one hand the nuance between their C 
values may be a signature that our algorithm lacks confi- 
dence in partitioning node PL due to its special role in be- 
tween the two subdivisions, and on the other hand their 
overwhelming agreement may suggest that our algorithm 
is quite confident in partitioning all other nodes except 
PL. This is consistent with the big gap between the sec- 
ond and the third maxima of C, which indicates that our 
algorithm would prefer to discard any other groupings 
except those shown in Fig. 4 and Fig. 9. 

These results may be an indication that the natural 
subdivisions formed after the splitting of the network are 
the only main topological structure from the view point 
of group partition in this network. Unlike the karate net- 
work where different structures may coexist, the network 
of dolphin lacks a 'core' of dominant nodes around which 
the other nodes are organized. This topological difference 
may have implications in understanding the different so- 
cial behaviors of the two societies. 



V. EXTENSION TO THE WEIGHTED 
NETWORKS 

As the expectation maximization algorithms have so 
many advantages, it is desirable to extend them to 
weighted networks. In fact the Newman and Leicht 
scheme favors such an extension. A straightforward 
method was suggested in [35| where the weight of each 
link was related to its contribution to the C value. 
In this section we discuss this problem based on the 
APBEMA, but the derivations are similar and straight- 
forward for the heterogeneity suppressed and the hetero- 
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FIG. 9: The dolphin social network [13, [3, 5j| . Same as in 
Fig. 4 but the partition represented by the dashed line, given 
by both the APBEMA and the HWA, corresponds to the sec- 
ond maximum of C (see Fig. 8) instead. In this partition only 
the node SN89 is not classified into the natural subdivision it 
belongs to [45| . A comparison with the partition correspond- 
ing to the first maximum of L (see Fig. 4) indicates a special 
role node PL may play. 

geneity weighted algorithm. The radical difference be- 
tween our scheme and that in [35[ is that in our algo- 
rithm it is the information provided by each entry of the 
adjacency matrix that is weighted. 
We rewrite Eq. (|3.5p in the form of 

C = ^ q tr In 7r r + ^2 [Aij ^ qir ln Pn 

i,r i.j r 

+(l-A ij )J2<lirH 1 -prj)} (5T) 

r 

from which we can tell that the term between the square 
brackets represents the contribution to the C value given 
by Ay, namely the information of the connection state 
between node i and node j. Obviously, no matter Ay = 1 
or Aij — its contribution is equally important and 
counts. Hence if we attach a weight ojy to the infor- 
mation provided by Aij , then the C value for the aim of 
grouping should naturally be replaced by 

C u = q ir In 7r r + LOjj [Ay q ir In p rj 

i,T i,j r 

+ (l-Ay)^< fe -ln(l-p rj )]. (5.2) 

r 

Next, we assume the right grouping should be the one 
that maximize £ w with the constrain ^ r 7iy = 1. The 
deduction is then the same as in the APBEMA and finally 
we have 

7r. r = - q ir (5.3) 
n t— 1 

and 

Prj = V"^-''-'. (5.4) 

/ y j qir^ij 



where qi r is still given by Eq. (|3.6p . It is apparent that, 
for an unweighted network where w tJ = const, this algo- 
rithm is reduced to the APBEMA as expected. 

Similarly, if the constraints of Eq. (|4.1|) or Eq. (|4.6D 
are taken into account, we can get the heterogeneity 
suppressed or heterogeneity weighted algorithm for the 
weighted network as well. 

It is important to note that wy is the weight of the 
information provided by Ay rather than of the link be- 
tween node i and j. (Note that though in calculating p r j 
(Eq. (|5.4[) ) LOij does not count in evaluating the numera- 
tor if A^ = 0, it does in evaluating the denominator.) In 
other words, even if there is no link between node i and 
node j, this piece of information (Ay = 0) is equally im- 
portant for recognizing the group structure. This result 
is consistent with our intuition and experience. 

In order to well appreciate the implications of this al- 
gorithm, let us take the network studied in Fig. 3 as an 
illustration. For the sake of simplicity, we assume that 
all the weights take only two values: 1 and to. Here lo 
is a constant used to weight a selected potion of entries 
of the adjacency matrix and < to < 1; it is introduced 
to control the information of that potion the algorithm 
can use and so that we can investigate how the group- 
ing results depend on it. We consider the following three 
cases: (i) = 1 for Ay = and wy = lo for Ay = 1; 
(ii) Wij = 1 for Aij = 1 and wy = lo for Aij = 0; (iii) 
LOij = w if both node i and node j are in the clique and 
Lo^ — 1 otherwise. For lo — 0, since a crucial part of 
information of the network topology lacks, we may ex- 
pect a failure of grouping. As lo is increased, more and 
more information are taken into account, the grouping 
should be more and more accurate. Finally, as lo = 1 is 
approached, all the topological information is considered, 
our algorithm should suggest the grouping as perfectly as 
the APBEMA does. This conjecture has been well ver- 
ified by the simulations. In Fig. 10 the grouping error 
rate against lo is summarized for the case where the net- 
work has n — 70 nodes, the clique size is n c = 7 and the 
average degree of the background nodes k BG = 3. Each 
data point represents the averaged error rate over 1000 
realizations of the network. 

In the first case (solid squares in Fig. 10), the infor- 
mation associated with the null links is fully considered 
but that associated with the links is controlled by lo. For 
lo = their contributions are completely ignored; as a 
consequence the algorithm 'sees' all the nodes isolated 
from each other and classifies them into a single group. 
To increase lo from zero, thought slightly, would stop the 
algorithm from classifying all the nodes in a single group, 
but the error rate is still high. As lo is increased further, 
more and more information of the links is available and 
the partition becomes more and more accurate. When 
it comes to the point lo ~ 0.7, the information seems to 
have been enough for the algorithm to recognize well the 
clique from the background. This phenomenon is inter- 
esting: it suggests that in fact there is a redundance in 
information for the use of partition in the network under 
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FIG. 10: Error rates for the grouping results suggested by the 
weighted APBEMA in identifying a fully connected clique of 
n c = 7 nodes immersed in a randomly connected background 
of 63 nodes whose average degree is k BG — 3. The information 
contained in each entry of the adjacency matrix is weighted 
by either ui or 1, and solid squares, open squares and solid 
dots represent three different ways for assigning the weights 
among the entries, which correspond to the case (i), (ii) and 
(iii) as described in the text (see the text). In all the cases 
as uj is increased the grouping becomes more accurate, which 
supports the viewpoint that the information of links and null 
links are equally important. 

study. 

In the second case (open squares in Fig. 10), the in- 
formation associated with the links is fully considered 
but that with the null links is tuned by to. Similarly, 
for u> — the algorithm cannot 'see' the null links and 
thus the background. All nodes are regarded to be in 
one well connected group. This result shows clearly the 
information of the null links is a requisite for a correct 
partition. As ui is increased from zero, the error rate un- 
dergoes an abrupt drop. This is because here we have 
much more null links than links and hence even a small 
value of ui may release much more information than in 
the case (i). To increase u> further would improve the 
grouping correspondingly just as expected. 

In the last case (solid dots) the weights of the infor- 
mation associated with the clique is varied instead, but 
again we have qualitatively the same result as in the first 
two cases. These results are in good consistency with our 
discussions on the weighted APBEMA from the informa- 
tion perspective. 

To weight the information contained in {^4^} can be 
more relevant in practice. To construct a network repre- 
sentation of a real complex system, it involves unavoid- 
ably the measurement of the connection state between 
any two nodes. In a general case, the measurement does 
not generate a definite zero/one output; rather, the errors 
and uncertainties are entangled intrinsically In many 
cases, such as in some biological systems, biochemical 
systems and human societies, as the relations between 
the elements can be numerous and of various types on 



one hand, and these relations themselves can be coupled 
with each other on the other hand, the problem of mea- 
surement is even more subtle and difficult. Hence for any 
network abstracted in the end, the evaluations of the con- 
fidence in the measured connection states are important 
and necessary. These evaluations of the confidence are 
the ideal measures of the weights considered here. 



VI. SUMMARY 

In this work we have studied how to detect the groups 
in a complex network that consist of nodes having the 
similar connection pattern. Our algorithm is based on 
the mixture models and the exploratory analysis sug- 
gested by Newman and Leicht, but significant differences 
exist. In our algorithm the connection pattern is mod- 
elled by the a priori probability assumption instead. The 
main advantages of our algorithm are that (i) It can be 
applied without any restriction on the degree distribu- 
tion; (ii) It possesses the symmetry between the links 
and the null links; (iii) It is flexible in dealing with the 
heterogeneity effects; and (iv) It can be extended to the 
connection information weighted networks. These advan- 
tages have been illustrated by various network examples. 

With our algorithm we have studied the role played 
by the heterogeneity. We find that the grouping result 
may depend on the heterogeneity effects involved. This 
finding suggests that in order to have a thorough knowl- 
edge of the network structure, this dependence should be 
analyzed. For this reason all the groupings found (at var- 
ious values of w, see Sec. IV) are justified. This can be 
seen as an extension to the definition of group formally 
defined at w = 1 when the heterogeneity effects are fully 
considered. 

Based on our analysis, it is natural to extend our algo- 
rithm to the connection information weighted networks. 
This result is a direct implication of our a priori prob- 
ability based group connection pattern model. As the 
connection information weighted networks can be closely 
related to the measurement of networks, we expect our 
extended algorithm may find wide applications. 

Finally, our study has also suggested that groupings 
associated with other top maxima of the merit function 
(£) could be meaningful and useful as well. This may be 
a common feature among the expectation maximization 
algorithms. How to interpret these groupings seems to 
be interesting and potentially important that deserves 
further investigations. 
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